A Semi-Automatic Approach of old Arabic Documents Indexing
نویسندگان
چکیده
indexing is a largely used technique in retrieval systems. It has as goal to extract and to represent the meaning of a document so that it can be found by the user. We can cite two types of indexing: manual indexing, and automatic indexing. The automatic indexing requires to use character and words recognition engines which work only over the texts of contemporary documents. In this paper, we propose a semiautomatic approach of old Arabic documents images indexing and searching without resorting to recognize their contents in order to deal with the incapacity of the recognition techniques to understand the contents of old documents. The proposed approach repose on the representation of the documents according to the structural features of their indexes chosen manually from each document by an expert. The approach is tested on a sample of approximately 1100 envelopes and shows good results. Keywords-component; indexing, old documents, structural features, documents analysis
منابع مشابه
Modelspace - Cooperative Document Information Extraction in Flexible Hierarchies
Business document indexing for ordered filing of documents is a crucial task for every company. Since this is a tedious error prone work, automatic or at least semi-automatic approaches have a high value. One approach for semi-automated indexing of business documents uses self-learning information extraction methods based on user feedback. While these methods require no management of complex in...
متن کاملSemi-Automatic Indexing of Multilingual Documents
With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval of documents written in different languages. We present a method for semi-automatic indexing of electronic documents and construction of a multilingual thesa...
متن کاملA semi-automatic indexing system based on embedded information in HTML documents
Purpose – This paper describes and evaluates the tool DigiDoc MetaEdit which allows the semi-automatic indexing of HTML documents. The tool works by identifying and suggesting keywords from a thesaurus according to the embedded information in HTML documents. This enables the parameterization of keyword assignment based on how frequently the terms appear in the document, the relevance of their p...
متن کاملDigital Learning for Summarizing Arabic Documents
We present in this paper an automatic summarization method of Arabic documents. This method is based on a numerical approach which uses a semi-supervised learning technique. The proposed method consists of two phases. The first one is the learning phase and the second is the use phase. The learning phase is based on the Support Vector Machine (SVM) algorithm. In order to evaluate our method, we...
متن کاملIndexation des documents XML : Un DataGuide annoté avec un index de contenu
Indexing in classical information retrieval brings few tools for the treatment of the semi-structured documents: the representations of documents in information retrieval were conceived for flat and homogeneous documents. They are not adapted to the simultaneous treatment of the structure and the contents. Several approaches of indexing semi-structured data was proposed to resolve this new chal...
متن کامل